Murders in Poland 1999-2020. Data Analysis in Python.¶

Data Engineering and Big Data | Warsaw School of Economics (SGH)
Postgraduate Studies Project

Contents¶

  1. The scope of analysis and main conclusions
  2. Part 1: Analysis of murders incidents
    1. Data loading and structure verification
    2. Calculating percantage share
    3. Data grouping and verification
    4. Overal analysis per province
    5. Analysis of provinces share rate
    6. Analysis of maximal/minimal values and crime intensity
    7. Detection rate analysis
    8. Differences between 2019 and 2020
  3. Part 2: Correlation analysis for 2020
    1. Data loading and structure verification
    2. Data segmentation
    3. Relationship analysis: murders, population, province area
    4. Calculation of the murders crime rate
    5. Correlation analysis with chosen variables

The scope of analysis and main conclusions¶


The purpose of the reserach is the analysis of data presenting number of murders across Poland between 1999 and 2020 (included), broken down by provinces and compared with data related to population and area of the provinces in 2020. The data is extracted from two sources: Police Statistics and Central Statistical Office.

The research is composed of two parts. Part 1 is focused on murder crime trend analysis over the years by provinces, and leads to the recognition of frequency and volume of crimes within particular locations. On top of that, observation dedicated to analyzing crime intensity in time and identification of periods with the highest and lowest rate levels. Additional metric taken under consideration is the general detection rate level and its values over the years. The last component covers comparison between 2019 and 2020, so that key differences can presented and examined. Part 2 enables widening the perspective by adding data about population and provinces areas and exploring correlation by identyfing coefficient relationship between variables. This part covers oly 2020 for all data included in the correlation test. Basing on the results, the relationdhip between key variables is visualised.


Definitions applied in analysis:

- administrative unit: provinces of Poland, e.g. Silesia. - crime registered: number of murders within particular province; followed with official procedures by investigative authorities. - crime detected: number of crimes (among registered) where at least one suspect is appointed. - detection rate: number of crimes detected divided by crimes registered and multiplied by 100, for percantage calculation.


Questions to answer:

  • [1] What is the general trend of registered crimes across Poland within last 22 years?
  • [2] Which provinces note the greatest number of murders? What is the share of each in Poland?
  • [3] What is the crime detection effectiveness? Does it improve over time?
  • [4] When the greatest and the lowest numbers of murder crime incidents are noted?
  • [5] Is there any remarkable difference between 2019 and 2020? Which of the provinces stand out and how?
  • [6] Are there relationships between murders and the characteristics of particular province in 2020? What correlations are observed?

Conclusions:

Crime level of the incidents classified as murders decreases with each year. When looking at the provinces with the geatest volume of murders, four of them focus attention. These are: Masovia, Silesian, Lower Silesian, Lodzkie. The first three of them represent the majority of overall share between 1999-2020, where the exact percentage rates are: Masovia (15,8%), Silesian (12,9%), Lower Silesian (9,9%). On the other hand, the lowest records are characteristic for Opole (2%), Subcarpathian (3%), Świętokrzyskie (3%), Podlaskie (3%). Interesting fact: Masovia notes the highest number of murders every year. Unlike Opolskie, which stands for the lowest results consistently.

Questions [1] and [2]

Described results are not equal to high or low level of crime generally. Each case needs to be considered individually, due to local characteristics related to population and area. As the main indicator, the crime rate metric is analzyed as even when the number of crimes itself is lower, the crime rate may be higher once comparing different locations.

Detection of crimes grows significantly while number of crimes decreases. Within last 22 years, the general detection rate increases by ar.12 percentage points, from 86% to 98% countrywide. The highest records are noted in 2000 and 2001 (accordingly: 1270, 1325). On the contrary, in 2016 notes the lowest record of 456 murders. This also confirm the overall decline that happens over the years.

Questions [3] and [4]

Nevertheless, there is remarkable growth in the number of murders in 2020 once comparing to 2019. In some locations the number is twice as big. As example, it is worth mentioning Masovia (+109%), Lubusz (+100%), Kuyavian-Pomeranian (+80%). The opposite situation is observed in Subcarpathian where quite spectacular decline is noted - by 50% in 2020 vs 2019. This phenomenon can be dictated by pandemic and strict rules of health control and public safety followed around the world, what led to minimized crime. In order to confirm such hypothesis, more complex and deepen analysis would be required in order to include additional factors and data volume.

Question [5]

In Part 2 of the analytical work, enriched with two variables: population and area of each province in 2020, there are some relationships identified. Facotr like big population (corr. 0.8) and population density (corr 0.52) seem to present positive relationship with the number of murders. The provinces with the highest number are Masovia, Silesian. Meanwhile, the ones with the highest crime density (number of crimes per resident) are Lubusz, Lower Silesian, West Pomerania and these are classified as more dangerous.

Question [6]

Second interesting thing is quite negative correlation between crime detection rate and population (-0.37); similarly with the area of the province (-0.32). The bigger the population is, the more challenging it is to identify the suspects. Within provinces with bigger area, there is also a better way to hide or escape. Next to this, there are also provinces with high detection rate as a rule, represented by: Kuyavian-Pomeranian, Opole, Lubusz. These are locations with significantly small area and higher crime density. These factors can also contribute to higher detection than in other locations on average.

Question [6]

Part 1: Analysis of murder incidents¶

Data loading and structure verification¶

---- Back to top ----

InĀ [1]:
# Importing needed libraries in order to make use of them in the further analysis and visualization. 

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
InĀ [2]:
# Openning my data file from the local catalogue.

crime = pd.read_csv('PATH/crime-murder-1999-2020.csv', sep=',')
InĀ [3]:
# Checking crime types in order to verify whether no update is required.

crime.dtypes
Out[3]:
administrative_unit    object
year                    int64
crime_registered        int64
crime_detected          int64
dtype: object
InĀ [4]:
# Changing crime type for column 'year' as it is not needed for calculations, meanwhile required as a string value.

crime['year'] = crime['year'].astype('object')
crime.dtypes
Out[4]:
administrative_unit    object
year                   object
crime_registered        int64
crime_detected          int64
dtype: object
InĀ [5]:
# Checking table structure by viewing rows and columns number.

crime.shape
Out[5]:
(396, 4)
InĀ [6]:
# Analyzing crime structure by calling the columns names. Additionally, previewing several first rows.

crime.columns
Out[6]:
Index(['administrative_unit', 'year', 'crime_registered', 'crime_detected'], dtype='object')
InĀ [7]:
crime.head()
Out[7]:
administrative_unit year crime_registered crime_detected
0 Lower Silesian 2020 69 68
1 Lower Silesian 2019 33 33
2 Lower Silesian 2018 59 55
3 Lower Silesian 2017 52 51
4 Lower Silesian 2016 53 53
InĀ [8]:
# Displaying maximum, minimum and average values; verifying whether records count is the same per columns.

crime.describe()
Out[8]:
crime_registered crime_detected
count 396.000000 396.000000
mean 49.959596 45.926768
std 35.763901 30.785862
min 4.000000 5.000000
25% 25.750000 24.000000
50% 41.000000 39.000000
75% 64.250000 59.000000
max 233.000000 179.000000
InĀ [9]:
# Checking for null values in order to clean / update / replace / delete.

crime.isnull().values.any()
Out[9]:
False
InĀ [10]:
# Displaying unique values from the first column (administrative_unit) in order to see all of the locations.

crime['administrative_unit'].unique()
Out[10]:
array(['Lower Silesian', 'Kuyavian-Pomeranian', 'Lublin', 'Lubusz',
       'Lodzkie', 'Lesser Poland',
       'Masovia (KWP z/s w Radomiu i KSP Warszawa)', 'Opole',
       'Subcarpathian', 'Podlaskie', 'Pomerania', 'Silesian',
       'Świętokrzyskie', 'Warmian-Masurian', 'Greater Poland',
       'West Pomerania', 'teren działania KWP z/s w Radomiu',
       'teren działania KSP Warszawa'], dtype=object)
InĀ [11]:
# Updating naming of the Masovia province so that naming is kept consistent with the rest from the set.

crime.loc[crime['administrative_unit'] == 'Masovia (KWP z/s w Radomiu i KSP Warszawa)', 'administrative_unit'] = 'Masovia'
InĀ [12]:
# Additionally, I eliminate data from two locations ('teren działania KWP z/s w Radomiu' and 'teren działania KSP Warszawa').
# These are already included in summarized data for Masovia, no need to duplicate it. 
# Saving the data within a new file so that I can keep the set locally.

crime = crime[~crime['administrative_unit'].isin(['teren działania KWP z/s w Radomiu', 'teren działania KSP Warszawa'])]
crime.to_csv('PATH/crime-murder-UPDATED.csv', index = False)
InĀ [13]:
crime['administrative_unit'].unique()
Out[13]:
array(['Lower Silesian', 'Kuyavian-Pomeranian', 'Lublin', 'Lubusz',
       'Lodzkie', 'Lesser Poland', 'Masovia', 'Opole', 'Subcarpathian',
       'Podlaskie', 'Pomerania', 'Silesian', 'Świętokrzyskie',
       'Warmian-Masurian', 'Greater Poland', 'West Pomerania'],
      dtype=object)

Calculating percentage share¶

---- Back to top ----

InĀ [14]:
# Adding column with crime detection calculation (percentage) and adjusting decimal points.
# Updating the source file (UPDATED version) and checking its structure afterwards.

crime['detection_rate'] = (crime['crime_detected']/crime['crime_registered']*100).round()
crime.to_csv('PATH/crime-murder-UPDATED.csv', index=False)
crime.head()
Out[14]:
administrative_unit year crime_registered crime_detected detection_rate
0 Lower Silesian 2020 69 68 99.0
1 Lower Silesian 2019 33 33 100.0
2 Lower Silesian 2018 59 55 93.0
3 Lower Silesian 2017 52 51 98.0
4 Lower Silesian 2016 53 53 100.0
InĀ [15]:
# Checking crime types after the updates. The newly added one is used later in calculations.

crime.dtypes
Out[15]:
administrative_unit     object
year                    object
crime_registered         int64
crime_detected           int64
detection_rate         float64
dtype: object
InĀ [16]:
# Checking data for different provinces, for overall preview of the details.

crime.loc[crime['administrative_unit'] == 'Lower Silesian']
Out[16]:
administrative_unit year crime_registered crime_detected detection_rate
0 Lower Silesian 2020 69 68 99.0
1 Lower Silesian 2019 33 33 100.0
2 Lower Silesian 2018 59 55 93.0
3 Lower Silesian 2017 52 51 98.0
4 Lower Silesian 2016 53 53 100.0
5 Lower Silesian 2015 51 46 90.0
6 Lower Silesian 2014 41 40 98.0
7 Lower Silesian 2013 70 69 99.0
8 Lower Silesian 2012 62 58 94.0
9 Lower Silesian 2011 61 58 95.0
10 Lower Silesian 2010 63 61 97.0
11 Lower Silesian 2009 82 83 101.0
12 Lower Silesian 2008 88 86 98.0
13 Lower Silesian 2007 94 94 100.0
14 Lower Silesian 2006 84 84 100.0
15 Lower Silesian 2005 70 64 91.0
16 Lower Silesian 2004 107 105 98.0
17 Lower Silesian 2003 106 101 95.0
18 Lower Silesian 2002 119 109 92.0
19 Lower Silesian 2001 125 114 91.0
20 Lower Silesian 2000 134 123 92.0
21 Lower Silesian 1999 101 87 86.0
InĀ [17]:
# Some records are incorrect as detection_rate column includes values greater than 100.0, what is impossible.
# The number of crime_detected is higher than crime_registered in these cases. This requires switch of the vaues in columns.

crime.loc[crime['detection_rate'] > 100.00].count()
Out[17]:
administrative_unit    17
year                   17
crime_registered       17
crime_detected         17
detection_rate         17
dtype: int64
InĀ [18]:
crime.loc[crime['detection_rate'] > 100.00]
Out[18]:
administrative_unit year crime_registered crime_detected detection_rate
11 Lower Silesian 2009 82 83 101.0
34 Kuyavian-Pomeranian 2008 26 27 104.0
60 Lublin 2004 51 52 102.0
174 Opole 2000 23 24 104.0
179 Subcarpathian 2017 14 15 107.0
213 Podlaskie 2005 25 26 104.0
225 Pomerania 2015 28 29 104.0
231 Pomerania 2009 32 33 103.0
242 Silesian 2020 82 83 101.0
251 Silesian 2011 86 88 102.0
255 Silesian 2007 101 103 102.0
270 Świętokrzyskie 2014 4 5 125.0
273 Świętokrzyskie 2011 29 30 103.0
296 Warmian-Masurian 2010 25 26 104.0
302 Warmian-Masurian 2004 36 37 103.0
316 Greater Poland 2012 39 40 103.0
338 West Pomerania 2012 44 45 102.0
InĀ [19]:
# Switching the values within identified rows.

crime[['crime_registered','crime_detected']] =\
crime[['crime_detected','crime_registered']].where(crime['detection_rate'] > 100.00, 
                                                 crime[['crime_registered','crime_detected']].values)
InĀ [20]:
# Repeating the calculation for detection_rate with proper decimal points definition.

crime['detection_rate'] = (crime['crime_detected']/crime['crime_registered']*100).round()
crime.head()
Out[20]:
administrative_unit year crime_registered crime_detected detection_rate
0 Lower Silesian 2020 69 68 99.0
1 Lower Silesian 2019 33 33 100.0
2 Lower Silesian 2018 59 55 93.0
3 Lower Silesian 2017 52 51 98.0
4 Lower Silesian 2016 53 53 100.0
InĀ [21]:
# Checking on random record to verify the correctness.

crime[11:12]
Out[21]:
administrative_unit year crime_registered crime_detected detection_rate
11 Lower Silesian 2009 83 82 99.0
InĀ [22]:
# Checking all records in order to see whether there is still anything incorrect.

crime.loc[crime['detection_rate'] > 100.00].count()
Out[22]:
administrative_unit    0
year                   0
crime_registered       0
crime_detected         0
detection_rate         0
dtype: int64

Data grouping and verification¶

---- Back to top ----

InĀ [23]:
# Grouping crime by province and year. 16 provinces expected with 22 records from 1999-2020.

crime_groupby_province = crime.groupby('administrative_unit')
crime_groupby_province.ngroups
Out[23]:
16
InĀ [24]:
crime_groupby_province.size()
Out[24]:
administrative_unit
Greater Poland         22
Kuyavian-Pomeranian    22
Lesser Poland          22
Lodzkie                22
Lower Silesian         22
Lublin                 22
Lubusz                 22
Masovia                22
Opole                  22
Podlaskie              22
Pomerania              22
Silesian               22
Subcarpathian          22
Warmian-Masurian       22
West Pomerania         22
Świętokrzyskie         22
dtype: int64
InĀ [25]:
crime_groupby_year = crime.groupby('year')
InĀ [26]:
crime_groupby_year.ngroups
Out[26]:
22
InĀ [27]:
crime_groupby_year.size()
Out[27]:
year
1999    16
2000    16
2001    16
2002    16
2003    16
2004    16
2005    16
2006    16
2007    16
2008    16
2009    16
2010    16
2011    16
2012    16
2013    16
2014    16
2015    16
2016    16
2017    16
2018    16
2019    16
2020    16
dtype: int64

Overal analysis per province¶

CHARTS: 1.1, 1.2¶

---- Back to top ----

InĀ [28]:
# CHART: 1.1.
# Creating visual chart showing number of murders per each province and year (1999-2020).
# The general idea is to show overall trend over time and identify differences between locations.
# Adjustment of the chart details include: creating legend, descriptions, margins, overall look and feel.

plt.style.use('fivethirtyeight')
provinces = np.unique(crime['administrative_unit'])
colors = [plt.cm.viridis(i/float(len(provinces)-1)) for i in range(len(provinces))]
plt.figure(figsize = (22, 14))
for i, administrative_unit in enumerate(provinces):
    plt.scatter('year', 'crime_registered', 
                data = crime.loc[crime.administrative_unit == administrative_unit, :], 
                s = 60, 
                color = colors[i], 
                label = str(administrative_unit))
for j, administrative_unit in enumerate(provinces):
    plt.plot('year', 'crime_registered', 
             data = crime.loc[crime.administrative_unit == administrative_unit, :], 
             linewidth = 1, 
             color = colors[j], 
             label = "_nolegend_")
plt.title('Murders by province (1999-2020)',
          fontsize = 25,
          fontweight = 'bold',
          pad = 40)
plt.xlabel('Year', fontsize = 19,
           labelpad = 40,
           fontweight = 'bold')
plt.ylabel('Murders', 
           fontsize = 19, 
           labelpad = 40, 
           fontweight = 'bold')
plt.legend(fontsize = 20)  
plt.xticks(list(crime.year.unique()), 
           fontsize = 19)
plt.yticks(fontsize = 19)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image
InĀ [29]:
# Summarizing data regarding the number of murders broken down by provinces.
# Creating separate dataframe in order ot use it in further analyses.

summary_crime = crime.groupby('administrative_unit', as_index = False)\
.aggregate({'crime_registered': 'sum', 'crime_detected': 'sum'})

summary_crime
Out[29]:
administrative_unit crime_registered crime_detected
0 Greater Poland 960 884
1 Kuyavian-Pomeranian 645 618
2 Lesser Poland 996 917
3 Lodzkie 1498 1428
4 Lower Silesian 1725 1641
5 Lublin 995 938
6 Lubusz 655 615
7 Masovia 2728 2323
8 Opole 319 304
9 Podlaskie 548 519
10 Pomerania 1009 933
11 Silesian 2198 2107
12 Subcarpathian 465 431
13 Warmian-Masurian 748 711
14 West Pomerania 1095 1023
15 Świętokrzyskie 491 453
InĀ [30]:
# CHART 1.2.
# Generating visual representation of the summary of the murders within 1999-2020, per province.

plt.style.use('fivethirtyeight')
color = plt.cm.winter(np.linspace(0, 1, 10))
crime.groupby('administrative_unit')\
.crime_registered.sum()\
.plot.bar(color = color, 
          figsize = (20, 11))
plt.title('Sum of murders per province (1999-2020)',
          fontsize = 24, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Province', 
           fontsize = 18, 
           labelpad = 30, 
           fontweight = 'bold')
plt.ylabel('Sum of murders', 
           fontsize = 18, 
           labelpad = 30, 
           fontweight = 'bold')
plt.xticks(rotation=90, fontsize=16)
plt.yticks(fontsize=16)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image

Analysis of provinces share rate¶

CHART 2.1, MATRIX 3.1, 3.2¶

---- Back to top ----

InĀ [31]:
# Calcultaing the crime share rate during 1999-2020 per each province.
# Including these values in the sumamry_data as a seperate column / variable (share).

summary_Poland = sum(crime['crime_registered'])
share = (summary_crime['crime_registered'] / summary_Poland)*100
share.round()
Out[31]:
0      6.0
1      4.0
2      6.0
3      9.0
4     10.0
5      6.0
6      4.0
7     16.0
8      2.0
9      3.0
10     6.0
11    13.0
12     3.0
13     4.0
14     6.0
15     3.0
Name: crime_registered, dtype: float64
InĀ [32]:
summary_crime['share'] = share.round()
summary_crime
Out[32]:
administrative_unit crime_registered crime_detected share
0 Greater Poland 960 884 6.0
1 Kuyavian-Pomeranian 645 618 4.0
2 Lesser Poland 996 917 6.0
3 Lodzkie 1498 1428 9.0
4 Lower Silesian 1725 1641 10.0
5 Lublin 995 938 6.0
6 Lubusz 655 615 4.0
7 Masovia 2728 2323 16.0
8 Opole 319 304 2.0
9 Podlaskie 548 519 3.0
10 Pomerania 1009 933 6.0
11 Silesian 2198 2107 13.0
12 Subcarpathian 465 431 3.0
13 Warmian-Masurian 748 711 4.0
14 West Pomerania 1095 1023 6.0
15 Świętokrzyskie 491 453 3.0
InĀ [33]:
# CHART 2.1.
# Creatign visual representation as a pie chart showing share of each of the provinces in the overall number of murders.
# Thre of the provinces are distinguished: Masovia, Silesian, Lower Silesian.

plt.figure(figsize = (17,13))
labels = summary_crime['administrative_unit']
sizes = summary_crime['share']
explode = (0, 0, 0, 0, 0.1, 0, 0, 0.1, 0, 0, 0, 0.1, 0, 0, 0, 0)
color = plt.cm.winter(np.linspace(0, 2, 10))
plt.pie(sizes, 
        labels = labels, 
        explode = explode, 
        colors = color, 
        autopct = '%0.1f%%',
        textprops = dict(fontsize = 14), 
        wedgeprops = {'linewidth': 2.0, 
                      'edgecolor' : 'white'})
plt.title('Share rate of the provinces in murders overall (1999-2020)',
          fontsize = 22, 
          fontweight = 'bold', 
          pad = 10)
plt.axis('equal')
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image
InĀ [34]:
# Checking results of the highest and the lowest records each year.
# In order to observe trends per locations, choosing sample of 5 rows per category.
# Generating matrix view showing number of murders in each province and each year.

records = crime.loc[crime.groupby('year')['crime_registered'].nlargest(1).index.get_level_values(1)]
minimum = crime.loc[crime.groupby('year')['crime_registered'].nsmallest(1).index.get_level_values(1)]
records.head(), minimum.head()
Out[34]:
(    administrative_unit  year  crime_registered  crime_detected  \
 153             Masovia  1999               172             124   
 152             Masovia  2000               204             166   
 151             Masovia  2001               233             172   
 150             Masovia  2002               182             147   
 149             Masovia  2003               193             152   
 
      detection_rate  
 153            72.0  
 152            81.0  
 151            74.0  
 150            81.0  
 149            79.0  ,
     administrative_unit  year  crime_registered  crime_detected  \
 175               Opole  1999                23              21   
 174               Opole  2000                24              23   
 173               Opole  2001                21              20   
 172               Opole  2002                19              19   
 171               Opole  2003                19              18   
 
      detection_rate  
 175            91.0  
 174            96.0  
 173            95.0  
 172           100.0  
 171            95.0  )
InĀ [35]:
# MATRIX 3.1.
# Marking with 'gold' the greatest number of murders within each year. Masovia leads in the ranking.

matrix_max = crime.groupby(['administrative_unit', 'year'])['crime_registered'].sum().unstack()
matrix_max.style.highlight_max(color = 'gold', axis = 0)
Out[35]:
year 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
administrative_unit Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā 
Greater Poland 44 53 59 76 81 60 49 42 42 39 47 36 42 40 31 33 23 21 29 36 44 33
Kuyavian-Pomeranian 50 47 48 30 31 25 38 34 34 27 32 29 33 12 22 20 19 23 27 22 15 27
Lesser Poland 45 59 83 71 46 64 44 46 47 49 53 42 40 49 38 42 40 17 23 29 33 36
Lodzkie 64 108 119 123 98 98 88 81 87 75 56 64 58 43 51 50 42 28 36 44 41 44
Lower Silesian 101 134 125 119 106 107 70 84 94 88 83 63 61 62 70 41 51 53 52 59 33 69
Lublin 66 75 55 62 47 52 50 47 73 53 62 41 48 29 46 28 18 35 27 34 24 23
Lubusz 44 62 52 46 34 33 27 24 29 22 27 27 29 20 17 28 29 28 13 25 13 26
Masovia 172 204 233 182 193 147 154 126 155 123 119 106 100 90 70 74 63 68 92 80 70 107
Opole 23 24 21 19 19 13 13 16 15 14 14 17 10 17 10 9 15 8 9 12 13 8
Podlaskie 37 49 45 62 32 36 26 34 26 18 14 16 14 16 24 14 17 11 15 14 13 15
Pomerania 70 83 64 52 67 73 44 42 38 46 33 43 40 27 33 38 29 21 46 27 41 52
Silesian 145 157 188 131 116 138 107 119 103 89 99 90 88 78 69 67 77 69 51 63 71 83
Subcarpathian 25 37 38 26 27 19 21 26 12 26 26 14 14 16 28 25 6 15 15 16 22 11
Warmian-Masurian 53 52 60 73 53 37 41 44 27 29 40 26 21 26 18 18 19 22 25 17 23 24
West Pomerania 77 80 97 78 62 56 42 31 49 39 41 46 37 45 34 35 37 27 38 27 52 65
Świętokrzyskie 32 46 38 38 27 24 24 20 19 23 19 21 30 14 13 5 11 10 16 26 16 19
InĀ [36]:
# MATRIX 3.2.
# Marking with 'lime' the lowest number of murders within each year. Opole detected as the one which stands out.

matrix_min = crime.groupby(['administrative_unit', 'year'])['crime_registered'].sum().unstack()
matrix_min.style.highlight_min(color = 'lime', axis = 0)
Out[36]:
year 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
administrative_unit Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā 
Greater Poland 44 53 59 76 81 60 49 42 42 39 47 36 42 40 31 33 23 21 29 36 44 33
Kuyavian-Pomeranian 50 47 48 30 31 25 38 34 34 27 32 29 33 12 22 20 19 23 27 22 15 27
Lesser Poland 45 59 83 71 46 64 44 46 47 49 53 42 40 49 38 42 40 17 23 29 33 36
Lodzkie 64 108 119 123 98 98 88 81 87 75 56 64 58 43 51 50 42 28 36 44 41 44
Lower Silesian 101 134 125 119 106 107 70 84 94 88 83 63 61 62 70 41 51 53 52 59 33 69
Lublin 66 75 55 62 47 52 50 47 73 53 62 41 48 29 46 28 18 35 27 34 24 23
Lubusz 44 62 52 46 34 33 27 24 29 22 27 27 29 20 17 28 29 28 13 25 13 26
Masovia 172 204 233 182 193 147 154 126 155 123 119 106 100 90 70 74 63 68 92 80 70 107
Opole 23 24 21 19 19 13 13 16 15 14 14 17 10 17 10 9 15 8 9 12 13 8
Podlaskie 37 49 45 62 32 36 26 34 26 18 14 16 14 16 24 14 17 11 15 14 13 15
Pomerania 70 83 64 52 67 73 44 42 38 46 33 43 40 27 33 38 29 21 46 27 41 52
Silesian 145 157 188 131 116 138 107 119 103 89 99 90 88 78 69 67 77 69 51 63 71 83
Subcarpathian 25 37 38 26 27 19 21 26 12 26 26 14 14 16 28 25 6 15 15 16 22 11
Warmian-Masurian 53 52 60 73 53 37 41 44 27 29 40 26 21 26 18 18 19 22 25 17 23 24
West Pomerania 77 80 97 78 62 56 42 31 49 39 41 46 37 45 34 35 37 27 38 27 52 65
Świętokrzyskie 32 46 38 38 27 24 24 20 19 23 19 21 30 14 13 5 11 10 16 26 16 19

Analysis of maximal/minimal values and crime intensity¶

CHART 4.1, TABLES 5.1, 5.2, CHART 6.1¶

---- Back to top ----

InĀ [37]:
# Calculating maximum  and minimum values per province to see the most remarkable differences or potential anomalies.
# Merging the ruslts into one dataframe.

min_province = crime.groupby('administrative_unit').crime_registered.min()
record_province = crime.groupby('administrative_unit').crime_registered.max()
values_max_min = pd.merge(record_province, min_province, on = 'administrative_unit')
values_max_min.columns = ['maximum', 'minimum']
values_max_min
Out[37]:
maximum minimum
administrative_unit
Greater Poland 81 21
Kuyavian-Pomeranian 50 12
Lesser Poland 83 17
Lodzkie 123 28
Lower Silesian 134 33
Lublin 75 18
Lubusz 62 13
Masovia 233 63
Opole 24 8
Podlaskie 62 11
Pomerania 83 21
Silesian 188 51
Subcarpathian 38 6
Warmian-Masurian 73 17
West Pomerania 97 27
Świętokrzyskie 46 5
InĀ [38]:
# CHART 4.1.
# Generating visual representation showing minimum and maximum number of murders per year, within each province.
# Creating two layers, by combining line graph with scatter graph in order to see the overall trend and data points.

plt.style.use('fivethirtyeight')
plt.figure(figsize = (20, 12))
provinces = np.unique(crime['administrative_unit'])
plt.scatter(provinces, 
            record_province, 
            s = 150, 
            color = 'navy')
plt.scatter(provinces, 
            min_province, 
            s = 150, 
            color = 'lime')
plt.title('Maximum and minimum number of murders in provinces',
          fontsize = 24, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Province', 
           fontsize = 19, 
           labelpad = 20, 
           fontweight = 'bold')
plt.ylabel('Number of murders', 
           fontsize = 19, 
           labelpad = 40, 
           fontweight = 'bold')
plt.legend(['Maximum','Minimum'], 
           bbox_to_anchor = (0.95, 1.03), 
           fontsize = 19, 
           frameon = True, 
           facecolor = 'white', 
           edgecolor = 'black')
plt.plot(provinces, 
         record_province, 
         color = 'navy', 
         linewidth = 1)
plt.plot(provinces, 
         min_province, 
         color = 'lime', 
         linewidth = 1)
plt.xticks(rotation = 90, 
           fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image
InĀ [39]:
# TABLE 5.1.
# Veryfying the years with the highest records of murder crimes; marking out with 'gold'.
# Using 5 or more times in the ranking of the highest number per year to see the most intensive periods.

max_year =\
(crime.loc[crime.groupby('administrative_unit')['crime_registered']
         .nlargest(1)
         .index.get_level_values(1)])\
.groupby('year')\
.count()

def custom_style(row):
    color = 'white'
    if row.values[-1] >= 5:
        color = 'gold'
    return ['background-color: %s' % color]*len(row.values)

max_year.style.apply(custom_style, axis=1)
Out[39]:
Ā  administrative_unit crime_registered crime_detected detection_rate
year Ā  Ā  Ā  Ā 
1999 1 1 1 1
2000 6 6 6 6
2001 5 5 5 5
2002 3 3 3 3
2003 1 1 1 1
InĀ [40]:
# TABLE 5.2.
# Veryfying the years with the lowest records of murder crimes; marking out with 'lime'.
# Using 5 or more times in the ranking of the lowest number per year to see the most peaceful periods.

min_year =\
(crime.loc[crime.groupby('administrative_unit')['crime_registered']
         .nsmallest(1)
         .index.get_level_values(1)])\
.groupby('year')\
.count()

def custom_style(row):
    color = 'white'
    if row.values[-1] >= 5:
        color = 'chartreuse'
    return ['background-color: %s' % color]*len(row.values)

min_year.style.apply(custom_style, axis=1)
Out[40]:
Ā  administrative_unit crime_registered crime_detected detection_rate
year Ā  Ā  Ā  Ā 
2012 1 1 1 1
2014 1 1 1 1
2015 3 3 3 3
2016 5 5 5 5
2017 1 1 1 1
2018 2 2 2 2
2019 2 2 2 2
2020 1 1 1 1
InĀ [41]:
# CHART 6.1.
# Generating visual representation of showing minimum and maximum number of murders within each year.
# Creating two layers, by combining line graph with scatter graph in order to see the overall trend and data points.

plt.style.use('fivethirtyeight')
plt.figure(figsize = (20, 12))
year = records['year']
max_reg = records['crime_registered']
min_reg = minimum['crime_registered']
plt.scatter(year, 
            max_reg, 
            s = 150, 
            color = 'navy')
plt.scatter(year, 
            min_reg, 
            s=  150, 
            color = 'lime')
plt.title("aximum and minimum number of murders each year",
          fontsize = 26, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Year', 
           fontsize = 19, 
           labelpad = 40, 
           fontweight = 'bold')
plt.ylabel('Number of murders', 
           fontsize = 20, 
           labelpad = 40, 
           fontweight = 'bold')
plt.legend(['Maximum','Minimum'], 
           bbox_to_anchor = (0.93, 1.02), 
           fontsize = 20,
           frameon = True, 
           facecolor = 'white', 
           edgecolor = 'black')
plt.plot(year, 
         max_reg, 
         color = 'navy', 
         linewidth = 1)
plt.plot(year, 
         min_reg, 
         color = 'lime', 
         linewidth = 1)
plt.xticks(list(records.year), 
           fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image

Detection rate analysis¶

CHARTS 7.1, 7.2¶

---- Back to top ----

InĀ [42]:
# Summing data about crimes (murders) broken by year.
# Recalculating detection rate per each year and also the percentage of undetected crimes (100 minus detected rate).
# Creating separate dataframe in order to use it further in the analysis.

sum_years = crime.groupby('year').aggregate({'crime_registered': 'sum', 'crime_detected': 'sum'})
sum_years['detected'] = ((sum_years['crime_detected'] / sum_years['crime_registered'])*100).round()
sum_years['undetected'] = 100.00 - (sum_years['detected'])
sum_years
Out[42]:
crime_registered crime_detected detected undetected
year
1999 1048 903 86.0 14.0
2000 1270 1122 88.0 12.0
2001 1325 1171 88.0 12.0
2002 1188 1081 91.0 9.0
2003 1039 934 90.0 10.0
2004 982 911 93.0 7.0
2005 838 770 92.0 8.0
2006 816 762 93.0 7.0
2007 850 813 96.0 4.0
2008 760 719 95.0 5.0
2009 765 719 94.0 6.0
2010 681 633 93.0 7.0
2011 665 634 95.0 5.0
2012 584 555 95.0 5.0
2013 574 549 96.0 4.0
2014 527 506 96.0 4.0
2015 496 470 95.0 5.0
2016 456 440 96.0 4.0
2017 514 500 97.0 3.0
2018 531 511 96.0 4.0
2019 524 512 98.0 2.0
2020 642 630 98.0 2.0
InĀ [43]:
# CHART 7.1.
# Generating visual representation showing the level of crime/murder detection per each year.
# The chart gives clear view of detection rate improving over time, with decrease in the number of murders at the same time.

plt.style.use('fivethirtyeight')
plt.figure(figsize = (22, 12))
sum_reg = sum_years['crime_registered']
sum_det = sum_years['crime_detected']
sum_year = sorted(crime.year.unique())
indices = np.arange(len(sum_reg))
plt.bar(indices, 
        sum_reg, 
        width = 0.7, 
        color = 'navy', 
        label = 'Murders')
plt.bar(indices, 
        sum_det, 
        width = 0.1, 
        color = 'lime', 
        label = 'Detections')
plt.xticks(indices,
           ["{}".format(i) for i in (sum_year)], 
           fontsize = 19)
plt.yticks(fontsize = 19)
plt.title('Suma of detected and undetected murders per year',
          fontsize = 26, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Year', 
           fontsize = 22, 
           labelpad = 40, 
           fontweight = 'bold')
plt.ylabel('Number of murders', 
           fontsize = 22, 
           labelpad = 40, 
           fontweight = 'bold')
plt.legend(['Murders','Detections'], 
           bbox_to_anchor = (0.93, 1.02), 
           fontsize = 20,
           frameon = True, 
           facecolor = 'white', 
           edgecolor = 'black')
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image
InĀ [44]:
# CHART 7.2.
# Creating bar chart presenting proportion of the detected and undetected murders year after year.
# This confirms the trend of increasing detection rate.

plt.style.use('fivethirtyeight')
sum_years_detection = sum_years[['detected', 'undetected']].round()
ax = sum_years_detection.plot(kind = 'bar', 
                               stacked = True, 
                               width = 0.9, 
                               figsize = (20, 12), 
                               color = ['navy', 'magenta'])

for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()
    label_text = f'{height}'  
    label_x = x + width / 2
    label_y = y + height / 2
    if height > 0:
        ax.text(label_x, 
                label_y, 
                label_text,
                ha = 'center', 
                va = 'center',
                fontsize = 14,
                fontweight = 'bold',
                color= 'white')

plt.title('Murders detection per year',
          fontsize = 24, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Year', 
           fontsize = 20, 
           labelpad = 40, 
           fontweight = 'bold')
plt.ylabel('Detection rate %',
           fontsize = 20,
           labelpad = 40,
           fontweight = 'bold')
plt.legend(['Detected','Undetected'],
           bbox_to_anchor = (0.85, 0.94),
           fontsize = 18, 
           frameon = True,
           facecolor = 'white',
           edgecolor = 'black')
plt.xticks(fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image

Differences between 2019 and 2020¶

CHARTS 8.1, 8.2¶

---- Back to top ----

InĀ [45]:
# CHART 8.1.
# Due to meaningful difference observed between 2019 and 2020, when the volume of crimes increases significantly,
# generating visual represantation showing sum of murders per each province and comparing data from the two periods.
# The sepearate data frame is being created in order to use this in further analysis.

plt.style.use('fivethirtyeight')
color = ('navy', 'lime')
year_2019_2020 = crime[crime['year'].isin([2019, 2020])].copy()
year_2019_2020["year"] = year_2019_2020["year"].astype(str) 

sns.catplot(x = 'crime_registered', 
            y = 'administrative_unit',
            data = year_2019_2020, 
            hue = 'year', 
            kind = 'bar',
            height = 11, 
            aspect = 2, 
            palette = color, 
            legend = True)
plt.title('Sum of murders in each province in 2019 and 2020',
          fontsize = 26, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Number of murders', 
           fontsize = 20, 
           labelpad = 30, 
           fontweight = 'bold')
plt.ylabel('Province', 
           fontsize = 20, 
           labelpad = 30, 
           fontweight = 'bold')
# plt.legend(bbox_to_anchor = (0.96, 1.01), 
#            fontsize = 18, 
#            frameon = True, 
#            facecolor = 'white', 
#            edgecolor ='black')
plt.yticks(fontsize = 18)
plt.xticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image
InĀ [46]:
# Calculating percentage difference between 2019 and 2020 per each province.
# Pivoting the table and adding the newly calculated column as the new one in the dataframe.

year_2019_2020 = crime[crime['year'].isin([2019, 2020])]
diff_20_19 = year_2019_2020.pivot_table('crime_registered', ['administrative_unit'], 'year')
diff_20_19['percentage_diff'] = (((diff_20_19[2020] - diff_20_19[2019]) / diff_20_19[2019])*100).round()
diff_20_19
Out[46]:
year 2019 2020 percentage_diff
administrative_unit
Greater Poland 44.0 33.0 -25.0
Kuyavian-Pomeranian 15.0 27.0 80.0
Lesser Poland 33.0 36.0 9.0
Lodzkie 41.0 44.0 7.0
Lower Silesian 33.0 69.0 109.0
Lublin 24.0 23.0 -4.0
Lubusz 13.0 26.0 100.0
Masovia 70.0 107.0 53.0
Opole 13.0 8.0 -38.0
Podlaskie 13.0 15.0 15.0
Pomerania 41.0 52.0 27.0
Silesian 71.0 83.0 17.0
Subcarpathian 22.0 11.0 -50.0
Warmian-Masurian 23.0 24.0 4.0
West Pomerania 52.0 65.0 25.0
Świętokrzyskie 16.0 19.0 19.0
InĀ [Ā ]:
 
InĀ [47]:
# CHART 8.2.
# Generating tornado chart presenting difference in percentage in order to clearly pull apart the increases and decreases.
# Sorting the data by descending order to see the biggest incresease in first position.

plt.style.use('fivethirtyeight')
plt.figure(figsize = (20,15))
diff_20_19['colors'] = ['lime' if x < 0 else 'navy' for x in diff_20_19['percentage_diff']]
diff_20_19.sort_values('percentage_diff', 
                          inplace = True)
plt.hlines(y = diff_20_19.index, 
           xmin = 0, 
           xmax = diff_20_19.percentage_diff,
           color = diff_20_19.colors, 
           linewidth = 18)

for x, y, tex in zip(diff_20_19.percentage_diff, 
                     diff_20_19.index, 
                     diff_20_19.percentage_diff):
    plt.text(x, 
             y, 
             round(tex, 2), 
             horizontalalignment = 'right' if x < 0 else 'left', 
             verticalalignment = 'center', 
             fontdict = {'color':'black' if x < 0 else 'navy', 'size':17})

plt.title('Percentage difference in the volume of murders in 2020 vs 2019', 
          fontsize = 24, 
          fontweight = 'bold', 
          pad = 40)
plt.xlabel('Percentage', 
           fontsize = 19, 
           labelpad = 30, 
           fontweight = 'bold')
plt.ylabel('Province', 
           fontsize = 19, 
           labelpad = 30, 
           fontweight = 'bold')
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.tight_layout()
plt.margins(0.1)
plt.show()
No description has been provided for this image

¶

Part 2: Correlation analysis for 2020¶

Data loading and structure verification¶

---- Back to top ----

InĀ [48]:
# Opening second data file, including population and area of each province in 2020.
# Basing on this portion of data, additional variables are being defined in order to explore correlation with crimes.

population = pd.read_csv('PATH/provinces-population-area-Poland-2020.csv', sep=',')
InĀ [49]:
# Veryfying data types as well as columns and rows structure.

population.dtypes
Out[49]:
province      object
population     int64
area_km2       int64
dtype: object
InĀ [50]:
population.shape
Out[50]:
(16, 3)
InĀ [51]:
# Viewing the data in the datafarme.

population
Out[51]:
province population area_km2
0 Lower Silesian 2901225 19947
1 Kuyavian-Pomeranian 2077775 17972
2 Lublin 2117619 25122
3 Lubusz 1014548 13988
4 Lodzkie 2466322 18219
5 Lesser Poland 3400577 15183
6 Masovia 5403412 35558
7 Opole 986506 9412
8 Subcarpathian 2129015 17846
9 Podlaskie 1181533 20187
10 Pomerania 2333523 18321
11 Silesian 4533565 12333
12 Świętokrzyskie 1241546 11711
13 Warmian-Masurian 1428983 24173
14 Greater Poland 3493969 29826
15 West Pomerania 1701030 22897
InĀ [52]:
# Displaying maximum, minimum and average values; verifying whether records count is the same per columns.
# Generated values are then used within grouping / segments definition.

population.describe()
Out[52]:
population area_km2
count 1.600000e+01 16.000000
mean 2.400697e+06 19543.437500
std 1.277056e+06 6836.338883
min 9.865060e+05 9412.000000
25% 1.382124e+06 14884.250000
50% 2.123317e+06 18270.000000
75% 3.026063e+06 23216.000000
max 5.403412e+06 35558.000000

Data segmentation¶

---- Back to top ----

InĀ [53]:
# Creating segments according to four categories of provinces population. 
# Data is used then to compare population segment with area segment to explore potential relationships.

def seg_population(population):
    if population < 1382000:
        return 'small'
    elif population >= 1382000 and population < 2123000:
        return 'medium'
    elif population >= 2123000 and population < 3026000:
        return 'big'
    else:
        return 'very big'
InĀ [54]:
seg_population(30)
Out[54]:
'small'
InĀ [55]:
population['population_segment'] = population['population'].apply(seg_population)
population.head()
Out[55]:
province population area_km2 population_segment
0 Lower Silesian 2901225 19947 big
1 Kuyavian-Pomeranian 2077775 17972 medium
2 Lublin 2117619 25122 medium
3 Lubusz 1014548 13988 small
4 Lodzkie 2466322 18219 big
InĀ [56]:
# Creating segments according to four categories of provinces area. 
# Data is used then to compare population segment with area segment to explore potential relationships.

def seg_area(area):
    if area < 14885:
        return 'small'
    elif area >= 14885 and area < 18271:
        return 'medium'
    elif area >= 18271 and area < 23217:
        return 'big'
    else:
        return 'very big'
InĀ [57]:
seg_area(30)
Out[57]:
'small'
InĀ [58]:
population['area_segment'] = population['area_km2'].apply(seg_area)
population.head()
Out[58]:
province population area_km2 population_segment area_segment
0 Lower Silesian 2901225 19947 big big
1 Kuyavian-Pomeranian 2077775 17972 medium medium
2 Lublin 2117619 25122 medium very big
3 Lubusz 1014548 13988 small small
4 Lodzkie 2466322 18219 big medium
InĀ [59]:
# Previewing segment with provinces with the biggest population, including descending sorting order.

population.loc[population['population_segment'] == 'very big']\
.reset_index()\
.sort_values(['population'], 
             ascending = False)
Out[59]:
index province population area_km2 population_segment area_segment
1 6 Masovia 5403412 35558 very big very big
2 11 Silesian 4533565 12333 very big small
3 14 Greater Poland 3493969 29826 very big very big
0 5 Lesser Poland 3400577 15183 very big medium
InĀ [60]:
# Grouping data related to registered and detected crimes in 2020: registered, detected, detections.
# Adding it to the dataset with population and area, broken by provinces, as seperate columns.

year_2020 = crime_groupby_year.get_group(2020)
population[['crime_registered', 'crime_detected', 'detection_rate']] =\
year_2020[['crime_registered', 'crime_detected', 'detection_rate']].values
population
Out[60]:
province population area_km2 population_segment area_segment crime_registered crime_detected detection_rate
0 Lower Silesian 2901225 19947 big big 69.0 68.0 99.0
1 Kuyavian-Pomeranian 2077775 17972 medium medium 27.0 25.0 93.0
2 Lublin 2117619 25122 medium very big 23.0 23.0 100.0
3 Lubusz 1014548 13988 small small 26.0 26.0 100.0
4 Lodzkie 2466322 18219 big medium 44.0 43.0 98.0
5 Lesser Poland 3400577 15183 very big medium 36.0 35.0 97.0
6 Masovia 5403412 35558 very big very big 107.0 105.0 98.0
7 Opole 986506 9412 small small 8.0 8.0 100.0
8 Subcarpathian 2129015 17846 big medium 11.0 11.0 100.0
9 Podlaskie 1181533 20187 small big 15.0 15.0 100.0
10 Pomerania 2333523 18321 big big 52.0 51.0 98.0
11 Silesian 4533565 12333 very big small 83.0 82.0 99.0
12 Świętokrzyskie 1241546 11711 small small 19.0 19.0 100.0
13 Warmian-Masurian 1428983 24173 medium very big 24.0 24.0 100.0
14 Greater Poland 3493969 29826 very big very big 33.0 30.0 91.0
15 West Pomerania 1701030 22897 medium big 65.0 65.0 100.0

Relationship analysis: murders, population, province area¶

CHART 9.1¶

---- Back to top ----

InĀ [61]:
# CHART 9.1.
# Generating interactive bubble chart visualization, presenting number of murders in each province in 2020.
# The chart is aimed at showing relation between population and area of the particular province.

pio.renderers.default = 'notebook'

plt.figure(figsize=(22, 18))
fig = px.scatter(population, 
                 x='population', 
                 y='area_km2',
                 color='province',
                 size='crime_registered', 
                 hover_data=['province'],
                 labels={
                     'population': 'Population in millions',
                     'area_km2': 'Area in km2',
                     'province': 'Province'}, 
                 title = '<b>Population and area of the provinces in 2020<b>',
                 width=1000, 
                 height=600)
fig.update_layout(
    title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'}, 
    font=dict(
        size = 13), 
    hoverlabel=dict(
        bgcolor='white',
        font_size=12))
fig.show()
<Figure size 2200x1800 with 0 Axes>

Calculation of the murders crime rate¶

---- Back to top ----

InĀ [62]:
# Calculationg crime rate using population variable. Multyplying the division result by 100 000.
# This provides information about number of crimes per 100.000 residents of the province.

crime_population_rate = (population['crime_registered'] / population['population']*100000).round(2)
crime_population_rate
Out[62]:
0     2.38
1     1.30
2     1.09
3     2.56
4     1.78
5     1.06
6     1.98
7     0.81
8     0.52
9     1.27
10    2.23
11    1.83
12    1.53
13    1.68
14    0.94
15    3.82
dtype: float64
InĀ [63]:
# Calculationg crime rate using area variable. Multyplying the division result by 1000.
# This provides information about number of crimes per 1000 km2 of the province.

crime_area_rate = (population['crime_registered'] / population['area_km2']*1000).round(2)
crime_area_rate
Out[63]:
0     3.46
1     1.50
2     0.92
3     1.86
4     2.42
5     2.37
6     3.01
7     0.85
8     0.62
9     0.74
10    2.84
11    6.73
12    1.62
13    0.99
14    1.11
15    2.84
dtype: float64
InĀ [64]:
# Additional calculation covers population density, with the usage of poulation and area variables.
# Density rate represents number of residents per 1 km2 of the province.

pop_density_rate = (population['population'] / population['area_km2']).round()
pop_density_rate
Out[64]:
0     145.0
1     116.0
2      84.0
3      73.0
4     135.0
5     224.0
6     152.0
7     105.0
8     119.0
9      59.0
10    127.0
11    368.0
12    106.0
13     59.0
14    117.0
15     74.0
dtype: float64
InĀ [65]:
# Adding all calculated data as seperate columns. This way the comparison is more intuitive and indicate some relations.
# The numbers reveal that the volume of murders comitted itself is not enough to judge about the danger level in province.

population['crime_100-000'] = crime_population_rate
population['crime_1000km2'] = crime_area_rate
population['population_density'] = pop_density_rate
population.reset_index().sort_values(['population'], ascending = False)
Out[65]:
index province population area_km2 population_segment area_segment crime_registered crime_detected detection_rate crime_100-000 crime_1000km2 population_density
6 6 Masovia 5403412 35558 very big very big 107.0 105.0 98.0 1.98 3.01 152.0
11 11 Silesian 4533565 12333 very big small 83.0 82.0 99.0 1.83 6.73 368.0
14 14 Greater Poland 3493969 29826 very big very big 33.0 30.0 91.0 0.94 1.11 117.0
5 5 Lesser Poland 3400577 15183 very big medium 36.0 35.0 97.0 1.06 2.37 224.0
0 0 Lower Silesian 2901225 19947 big big 69.0 68.0 99.0 2.38 3.46 145.0
4 4 Lodzkie 2466322 18219 big medium 44.0 43.0 98.0 1.78 2.42 135.0
10 10 Pomerania 2333523 18321 big big 52.0 51.0 98.0 2.23 2.84 127.0
8 8 Subcarpathian 2129015 17846 big medium 11.0 11.0 100.0 0.52 0.62 119.0
2 2 Lublin 2117619 25122 medium very big 23.0 23.0 100.0 1.09 0.92 84.0
1 1 Kuyavian-Pomeranian 2077775 17972 medium medium 27.0 25.0 93.0 1.30 1.50 116.0
15 15 West Pomerania 1701030 22897 medium big 65.0 65.0 100.0 3.82 2.84 74.0
13 13 Warmian-Masurian 1428983 24173 medium very big 24.0 24.0 100.0 1.68 0.99 59.0
12 12 Świętokrzyskie 1241546 11711 small small 19.0 19.0 100.0 1.53 1.62 106.0
9 9 Podlaskie 1181533 20187 small big 15.0 15.0 100.0 1.27 0.74 59.0
3 3 Lubusz 1014548 13988 small small 26.0 26.0 100.0 2.56 1.86 73.0
7 7 Opole 986506 9412 small small 8.0 8.0 100.0 0.81 0.85 105.0
InĀ [66]:
# Creating separate dataframe, with chosen variables to use within correlation exploration.
# The correlation analysis is focused in checking relationships between crime_registered and detection_rate (dependent)
# and set of newly defined variables (independent): population, area, population density.

population_select = population[['province', 'population', 'area_km2','crime_registered', 'crime_detected', 'detection_rate', 'population_density']]
population_select
Out[66]:
province population area_km2 crime_registered crime_detected detection_rate population_density
0 Lower Silesian 2901225 19947 69.0 68.0 99.0 145.0
1 Kuyavian-Pomeranian 2077775 17972 27.0 25.0 93.0 116.0
2 Lublin 2117619 25122 23.0 23.0 100.0 84.0
3 Lubusz 1014548 13988 26.0 26.0 100.0 73.0
4 Lodzkie 2466322 18219 44.0 43.0 98.0 135.0
5 Lesser Poland 3400577 15183 36.0 35.0 97.0 224.0
6 Masovia 5403412 35558 107.0 105.0 98.0 152.0
7 Opole 986506 9412 8.0 8.0 100.0 105.0
8 Subcarpathian 2129015 17846 11.0 11.0 100.0 119.0
9 Podlaskie 1181533 20187 15.0 15.0 100.0 59.0
10 Pomerania 2333523 18321 52.0 51.0 98.0 127.0
11 Silesian 4533565 12333 83.0 82.0 99.0 368.0
12 Świętokrzyskie 1241546 11711 19.0 19.0 100.0 106.0
13 Warmian-Masurian 1428983 24173 24.0 24.0 100.0 59.0
14 Greater Poland 3493969 29826 33.0 30.0 91.0 117.0
15 West Pomerania 1701030 22897 65.0 65.0 100.0 74.0

Correlation analysis with chosen variables¶

CHARTS 9.2, 9.3, 9.4¶

---- Back to top ----

InĀ [67]:
# CHART 9.2.
# Building correlation matrix in order to see the strongest relationship between the chosen variables.

population_corr = population_select.corr(numeric_only=True)

plt.figure(figsize=(11, 10), dpi=90)
sns.heatmap(
    population_corr,
    xticklabels=population_corr.columns,
    yticklabels=population_corr.columns,
    cmap="winter",
    center=0,
    annot=True
)
plt.title('Correlation map with chosen variables', fontsize=15, fontweight='bold', pad=40)
plt.xticks(fontsize=11, fontweight="bold")
plt.yticks(fontsize=11)
plt.tight_layout()
plt.show()
No description has been provided for this image
InĀ [68]:
# METHOD 1: genarating p-value mapping, marking with '*'
# Each variable is geeting marked in order to display if the range of 0.01-0.1 statistical significance is met.

from scipy import stats
from scipy.stats import pearsonr

# Limiting the attributes selection to numeric only.
population_num = population_select.select_dtypes(include=["number"])

rho = population_num.corr(method="pearson")
pval = population_num.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.apply(lambda col: col.map(lambda x: ''.join(['*' for t in [0.01, 0.05, 0.1] if x <= t])))
rho.round(2).astype(str) + p
Out[68]:
population area_km2 crime_registered crime_detected detection_rate population_density
population 1.0*** 0.48* 0.8*** 0.79*** -0.37 0.7***
area_km2 0.48* 1.0*** 0.43* 0.42 -0.32 -0.25
crime_registered 0.8*** 0.43* 1.0*** 1.0*** -0.06 0.52**
crime_detected 0.79*** 0.42 1.0*** 1.0*** -0.03 0.52**
detection_rate -0.37 -0.32 -0.06 -0.03 1.0*** -0.13
population_density 0.7*** -0.25 0.52** 0.52** -0.13 1.0***
InĀ [69]:
# METHOD 2: getting p-values per each pair of the variables.

def pearsonr_pval(x,y):
        return pearsonr(x,y)[1]
    
corr = population_num.corr(method=pearsonr_pval)
corr
Out[69]:
population area_km2 crime_registered crime_detected detection_rate population_density
population 1.000000 0.060837 2.163474e-04 3.118559e-04 0.152622 0.002506
area_km2 0.060837 1.000000 9.802036e-02 1.071582e-01 0.222672 0.356061
crime_registered 0.000216 0.098020 1.000000e+00 1.137996e-22 0.823096 0.037853
crime_detected 0.000312 0.107158 1.137996e-22 1.000000e+00 0.908343 0.039344
detection_rate 0.152622 0.222672 8.230961e-01 9.083428e-01 1.000000 0.630942
population_density 0.002506 0.356061 3.785298e-02 3.934443e-02 0.630942 1.000000
InĀ [70]:
# METHOD 3: getting coefficient and p-values per chosen pairs of variables.

# Population
corr_pop = pd.DataFrame(columns=['r','p'])

for col in population_select:
        if pd.api.types.is_numeric_dtype(population_select[col]) and col != 'population':
            r, p = stats.pearsonr(population_select.population, population_select[col])
            corr_pop.loc[col] = [round(r,3), round(p,3)]
            
            
# Density
corr_den = pd.DataFrame(columns=['r','p'])

for col in population_select:
        if pd.api.types.is_numeric_dtype(population_select[col]) and col != 'population_density':
            r, p = stats.pearsonr(population_select.population_density, population_select[col])
            corr_den.loc[col] = [round(r,3), round(p,3)]
            
# Detection rate
corr_dtrate = pd.DataFrame(columns=['r','p'])

for col in population_select:
        if pd.api.types.is_numeric_dtype(population_select[col]) and col != 'detection_rate':
            r, p = stats.pearsonr(population_select.detection_rate, population_select[col])
            corr_dtrate.loc[col] = [round(r,3), round(p,3)]
            

print('Population', '\r\n', corr_pop, '\r\n')
print('\r\n', 'Density', '\r\n', corr_den, '\r\n')
print('\r\n', 'Detection rate', '\r\n', corr_dtrate)
Population 
                         r      p
area_km2            0.478  0.061
crime_registered    0.797  0.000
crime_detected      0.785  0.000
detection_rate     -0.375  0.153
population_density  0.701  0.003 


 Density 
                       r      p
population        0.701  0.003
area_km2         -0.247  0.356
crime_registered  0.523  0.038
crime_detected    0.519  0.039
detection_rate   -0.130  0.631 


 Detection rate 
                         r      p
population         -0.375  0.153
area_km2           -0.323  0.223
crime_registered   -0.061  0.823
crime_detected     -0.031  0.908
population_density -0.130  0.631
InĀ [71]:
# CHART 9.3.
# Veryfying distribution of the values by applying linear regression plot with chosen variables.
# According to the results given above, exploring the strongest positive correaltions (coefficient values): population, density.

plt.style.use('seaborn-v0_8')
sns.set(font_scale = 1.5)

sns.lmplot(x = 'population', 
           y = 'crime_registered', 
           data = population, 
           line_kws = {'color': 'orchid'}, 
           scatter_kws = {'color': 'navy'}, 
           height = 7, 
           aspect = 2)
plt.title('Number of murders against population (2020)', 
          fontsize = 16, 
          fontweight = 'bold', 
          pad = 20)
plt.xlabel('Population of the provinces', 
           fontsize = 17, 
           labelpad = 20)
plt.ylabel('Number of murders', 
           fontsize = 17, 
           labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)

sns.lmplot(x = 'population_density', 
           y = 'crime_registered', 
           data = population, 
           line_kws = {'color': 'lime'}, 
           scatter_kws = {'color': 'black'}, 
           height = 7, 
           aspect = 2)
plt.title('Number of murders against population_density (2020)', 
          fontsize = 16, 
          fontweight = 'bold', 
          pad = 20)
plt.xlabel('Population density: 1000 per km2', 
           fontsize = 17, 
           labelpad = 20)
plt.ylabel('Number of murders', 
           fontsize = 17, 
           labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
plt.show
Out[71]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image
No description has been provided for this image
InĀ [72]:
# CHART 9.4.
# Veryfying distribution of the values by applying linear regression plot with chosen variables.
# According to the results given above, exploring the strongest negative correaltions (coefficient values): population, area.

plt.style.use('seaborn-v0_8')
sns.set(font_scale = 1.5)

sns.lmplot(x = 'detection_rate', 
           y = 'population', 
           data = population,
           line_kws = {'color': 'coral'}, 
           scatter_kws = {'color': 'darkgreen'}, 
           height = 7, 
           aspect = 2)
plt.title('Detection rate against population (2020)', 
          fontsize = 16, 
          fontweight = 'bold', 
          pad = 20)
plt.xlabel('Detection rate of murders', 
           fontsize = 17, 
           labelpad = 20)
plt.ylabel('Population of the provinces', 
           fontsize = 17, 
           labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)

sns.lmplot(x = 'detection_rate', 
           y = 'area_km2', 
           data = population,
           line_kws={'color': 'yellow'}, 
           scatter_kws = {'color': 'maroon'}, 
           height = 7, 
           aspect = 2)
plt.title('Detection rate against the area in km2 (2020)', 
          fontsize = 16, 
          fontweight = 'bold', 
          pad = 20)
plt.xlabel('Detection rate of murders', 
           fontsize = 17, 
           labelpad = 20)
plt.ylabel('Area of the provinces in km2',
           fontsize = 17, 
           labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
plt.show
Out[72]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image
No description has been provided for this image
InĀ [73]:
# In order to verify the above results related to detection rate, in detail per province: 
# 50 first records with the highest detection rate are genearted and summed up per province. 
# These are the provinces with the smallest area or the ratio of population to area - equal to greater population density.

crime.nlargest(50, 'detection_rate').groupby('administrative_unit').count()
Out[73]:
year crime_registered crime_detected detection_rate
administrative_unit
Kuyavian-Pomeranian 11 11 11 11
Lesser Poland 4 4 4 4
Lodzkie 5 5 5 5
Lower Silesian 4 4 4 4
Lublin 5 5 5 5
Lubusz 7 7 7 7
Opole 10 10 10 10
Subcarpathian 4 4 4 4

---- Back to top ----